PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Cai, Zefan; Zhang, Yichi; Gao, Bofei; Liu, Yuliang; Li, Yucheng; Liu, Tianyu; Lu, Keming; Xiong, Wayne; Dong, Yue; Hu, Junjie; Xiao, Wen

Computer Science > Computation and Language

arXiv:2406.02069 (cs)

[Submitted on 4 Jun 2024 (v1), last revised 15 May 2025 (this version, v4)]

Title:PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Authors:Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, Wen Xiao

View PDF HTML (experimental)

Abstract:In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC dataset. In the Needle-in-a-Haystack experiment, PyramidKV outperforms competing methods in maintaining long-context comprehension in LLMs; notably, retaining just 128 KV cache entries enables the LLAMA-3-70B model to achieve 100.0 Acc. performance.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.02069 [cs.CL]
	(or arXiv:2406.02069v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.02069

Submission history

From: Zefan Cai [view email]
[v1] Tue, 4 Jun 2024 07:51:30 UTC (4,708 KB)
[v2] Sun, 16 Jun 2024 06:41:08 UTC (8,227 KB)
[v3] Thu, 3 Oct 2024 08:46:42 UTC (10,182 KB)
[v4] Thu, 15 May 2025 17:18:12 UTC (29,047 KB)

Computer Science > Computation and Language

Title:PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators